5.2 MLE

1 MLE

For a generic dominated family P={Pθ|θΘ} with densities pθ, a simple estimator (maximum likelihood estimator, MLE) for θ is θ^MLE(X)=argmaxθΘpθ(x)=argmaxθΘl(θ;X).

The above example shows that, MLE can have embarrassing finite-sample performance despite being asymptotically optimal.

Proposition

If P(Bn)0, XnX, Zn arbitrary, then Xn1Bnc+Zn1BnX.

2 Asymptotic Efficiency

The nice behavior of MLE we found in the exponential family case generalizes to a much broader class of models.

Setting: X1,,Xni.i.dpθ(x), θΘRd. pθ is "smooth" in θ.
Let l1(θ;Xi)=logpθ(Xi), ln(θ;X)=i=1nl1(θ;Xi). Then J1(θ)=Varθ(l1(θ;Xi))=Eθ[2l1(θ;Xi)],Jn(θ)=Varθ(ln(θ;X))=nJ1(θ).
We say an estimator θ^n is asymptotically efficient if n(θ^nθ)pθN(0,J1(θ)1).
Delta method for differentiable estimand g(θ): n(g(θ^n)g(θ))PθN(0,g(θ)TJ1(θ)1g(θ)) also achieves CRLB if θ^n does.

3 Asymptotic Distribution of MLE

Under mild conditions, θ^MLE is asymptotically Gaussian, and efficient. We will be interested in ln(θ;X) as a function of θ. Notate "true" value as θ0 (XPθ0)
Then for θ0Θ, l1(θ0;Xi)i.i.d(0,J1(θ0)),1nln(θ0;X)=n1ni=1nl1(θ0;Xi)Pθ0N(0,J1(θ0))1n2ln(θ0;X)Pθ0Eθ02l1(θ0;Xi)=J1(θ0).

4 Consistency of MLE

X1,,Xni.i.dpθ0, θ^nargmaxθΘln(θ;X). [1] The question is when does θ^npθ0?

Recall KL divergence: DKL(θ0||θ)=Eθ0logpθ0(Xi)pθ(Xi)0. Let Wi(θ)=l1(θ;Xi)l(θ0;Xi), Wn=1ni=1nWi. Note θ^nargmaxθΘWn(θ) too. (4.1)Wn(θ)pEθ0Wi(θ)=DKL(θ0||θ)0,"=" iff θ=θ0. But this is not enough:

Convergence of function series

For compact K, let C(K)={f:KR,continuous}. For fC(K), let ||f||=suptK|f(t)|. Denote fn(p)f in this norm if ||fnf||(p)0.

Theorem (Uniform LLN)

Assume K is compact. W1,W2,C(K) iid. E||Wi||<, μ(t)=EWi(t). Then μ(t)C(K), and P(1ni=1nWiμ>ε)0.
(i.e. ||Wnμ||p0)

Theorem (Consistency of MLE for Compact Θ)

X1,,Xni.i.dpθ0. P has densities pθ, θΘ. Assume:

  • logpθ(x) is continuous in θ, xX.
  • Θ is compact.
  • Eθ0[supθΘ|Wi(θ)|]<.
  • Model identifiable.

Then θ^npθ0 if θ^nargmaxθΘln(θ;X).

We usually care about non-compact parameter spaces, so need some extra assumption to get us there.

Corollary

Same assumptions except now Θ=Rd (non-compact), but there is some R< large engou so Pθ0(||θ^nθ0||>R)0, then θ^npθ0.

So the only thing we actually need to worry about is if θ^n is extremely far away from θ0 with non-negligible probability.

Theorem (Asymptotic Distribution of MLE)

X1,,Xni.i.dpθ0, P has densities pθ,θΘ.
Assume

  • P identifiable.
  • Θ compact.
  • Eθ0[supθΘ|Wi(θ)|]<.
  • l(θ;X)=logpθ(X) has two continuous derivatives in θ.
  • Eθ0[supθΘ||2l1(θ;Xi)||]<.
  • J1(θ0)=Eθ02l1(θ0;Xi) is positive definite.

Then n(θ^nθ0)N(0,J1(θ0)1).


  1. Will be OK if θ^n comes close to maximizing ln. ↩︎